White Wine Quality Exploration by Jerry Wang

July 26, 2016

Summary of the Data Set

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Our dataset consists of thirteen variables with 4898 observations, the quality of wine has a median of 6 with min of 3 and max of 9. Some wines have no citric acid added, which can add ‘freshness’ and flavor to wines. Quality is the output attribute, 11 input variables (based on physicochemical tests) could be relevent,we will explore it in depth.

Univariate Plots Section

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Wines quality is scored from 0~10, in which 0 is the worest and 10 is the best. Quality histogram appears normal distribution, best quality is 9, most wine’s quality is scored between 5~6, There are more than 70% of wines in medium quality class.

Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Above three plots for fixed.acidity, volatitle.acidity and citrix.acid all appear normal distribution with some outliers. Especially the maximized fixed.acidity is reached 14.2.

Total Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1527 1527          14.2             0.27        0.49            1.1
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1527     0.037                  33                  156   0.992 3.15
##      sulphates alcohol quality quality.class total.acidity
## 1527      0.54    11.1       6        medium         14.96

I add a new variable called total.acidity, to add up all acid property variables together, the plot appears a normal distribution as well. In the dataset, there is only one wine with total.acidity large than 14, which is quality 6. Becasue of wine brewing features(time, temperture etc.) unkown, I don’t know what caused that.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality quality.class total.acidity
## 2782      0.69    11.7       6        medium         9.365

Distribution of residual.sugar has a long tail on the right side. After tranformed with log10, the distribution appears bimodal with the peaking around 1.5 and 7.5. Residual sugar means the amount of sugar remaining after fermentation stops, normally wine have more than 1 gram/liter sugar and wines with greater than 45 grams/liter are considered sweet. Here, we have minimze sugar is 0.6 and maximize sugar is 65.8. When checking the wine with residual sugar value 65.8, the quality is 6, same as total.acidity high value, I don’t know what caused that either.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chlorides: the amount of salt in the wines, normal distribution, median value is 0.043 and mean is 0.04577, very close to median.

Sulfur Dioxide

## [1] "Summary of total.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Histograms for free SO2, total SO2 and raio of free SO2, all appear normal distribution. Since sulphate can contribute to total sulfur dioxide levels, it has a similar histogram with the total sulfur dioxide.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a very small range from 0.9871 to 1.0390, very close to water’s density, distribution is normal.

PH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH: most wines pH values are between 3.0 - 3.4 on the pH scale(from 0 (very acidic) to 14 (very basic)), distribution is normal.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alochol percentage probably affects the density, pH level and the wine flavors. Just looking at the distributions of different levels quality, seems like the higher level of alcohol, the quality of wines is better.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in the dataset with 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality, and index X).

Quality is the output attribute, scored from 0~10, in which 0 is the worest and 10 is the best, original it’s integer variable(values: 3,4,5,6,7,8,9), 11 input variables(excluded X) are all numerical variables.

Other observations: The best quality of wines is scored 9, which is only 5 quantites, very rare. Most wines quality is in median level 6.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are quality, which may be correlated with some of these physicochemical attributes. I’d like to find out which attributes influence the quality of white wine. I suspect alcohol and some combination of the other attributes can be used to build a predictive model to quality the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Acidity, residual.sugar, total.sulfur.dioxide, pH likely contribute to quality of wines.

Did you create any new variables from existing variables in the dataset?

Yes, I create a new variable quality.class, and will use it to analyse the corelation between variables in the next two sections.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I transformed the positive skewed residual.sugar distributions with log10. The tranformed distribution for residual.sugar appears bimodal with the peaking around 1.5 and 7.5.

I also change quality properties to factor, and add a new factor, quality.class(low, medium and high), therefore in the Bivariate and Multivariate sections, I can explore those atttributes with different quality groups.


Bivariate Plots Section

Plot Matrix

Looking at the plot matrix, we can find that correlation coefficient between two variables above, the strongest correlations with quality occur with alcohol, density and chlorides(perasion r: 044, -0.31, -0.21). And the strongest correlations with alcohol occur with density, total.sulfur.dioxide, residual.sugar and chlorides (perasion r from -0.78 ~ -0.36).

Quality vs Alcohol

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

In this case, plots show wines with quality.class medium and high tend to have higher alcohol values. The boxplot shows that wines with quality 6~9 have higher alcohol values.

Quality vs Density

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

In this case, density vs quality or quality.class scatterplots show wines with quality 6-9/medium-high tend to have lower density, boxplot also display the same trend as scatterplots.

Quality vs Chlorides

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

In this case, quality vs chlorides scatterplot shows wines with quality 6-9 tend to have lower chlorides, quality.class vs chlorides scatterplot and boxplot also display the same trend.

Alcohol vs Density

We can see alcohol vs density have negitave linear relationship when we ignore the outliers.

Alcohol vs Total.Sulfur.Dioxide

Looking at scatterplot, total.sulfur.dixoxide vaules distribute on all level of alcohol, although we can see alcohol tends to decrease while total.sulfur.dioxide increasing in general trend, it’s not a linear relationship.

Alcohol vs Residual.Sugar

In gereral trend, with residual.sugar values increasing, alcohol values tend to decrease.

Alcohol vs Chlorides

With the cholorides increasing in the range of 0-0.1, alcohol values trend to decrease

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

When Looking at the plot matrix, we can find the strongest correlations with quality occur with alcohol, density and chlorides(pearsion r: 044, -0.31, -0.21).

Wines quality in the range of 6-9 or quality.class in medium and high, with the alcohol values increasing, wines quality tends to increase as well。 On the contrary, wines quality in the rang of 3-5 or quality.class in low level, with the alcohol increasing, wines quality trends to decrease.

Same correaltions happen on quality vs density and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, alcohol has correlations with density, residual.sugar, chlorides. These three variables have negative relationship with alcohol.

What was the strongest relationship you found?

My main purpose is to find which chemical properties influence the quality of wines. After comparing the relationship between quality and relavant variables, I found Alcohol has the strongest positive relationship with quality of wines.

Residual.sugar has the strongest relationship with density in the dataset, whose correlation coefficient is 0.84.


Multivariate Plots Section

Alcohol vs Density with Quality as Color

Here, plots clearly show wines with higher quality are in the right side of the plots, which is further shown that higher quality wines tend to have high alcohol and low density.

Alcohol vs Residual.Sugar with Quality as Color

Alcohol vs Chlorides with Quality as Color

Same as Alcohol vs Density, plots show that higher quality wines tend to have high alcohol values, low residual.sugar and low chlorides.

Linear Model

## 
## Calls:
## m1: lm(formula = alcohol ~ density, data = wines)
## m2: lm(formula = alcohol ~ density + residual.sugar, data = wines)
## m3: lm(formula = alcohol ~ density + residual.sugar + chlorides, 
##     data = wines)
## 
## =========================================================
##                       m1           m2           m3       
## ---------------------------------------------------------
##   (Intercept)      329.588***   564.755***   544.341***  
##                     (3.657)      (5.365)      (5.626)    
##   density         -320.991***  -558.645***  -537.841***  
##                     (3.679)      (5.414)      (5.684)    
##   residual.sugar                  0.167***     0.159***  
##                                  (0.003)      (0.003)    
##   chlorides                                   -4.614***  
##                                               (0.425)    
## ---------------------------------------------------------
##   R-squared             0.6          0.7          0.8    
##   adj. R-squared        0.6          0.7          0.8    
##   sigma                 0.8          0.6          0.6    
##   F                  7613.4       7302.6       5023.8    
##   p                     0.0          0.0          0.0    
##   Log-likelihood    -5668.6      -4580.9      -4522.6    
##   Deviance           2902.6       1861.6       1817.9    
##   AIC               11343.1       9169.7       9055.2    
##   BIC               11362.6       9195.7       9087.7    
##   N                  4898         4898         4898      
## =========================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Furthermore, according to the multivariate analysis revealed that higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values. Since the plots show there is a linear relationship between alcohol and it’s relavant variables(density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.

Were there any interesting or surprising interactions between features?

In the low quality group of wines, with quality increasing, alcohol value has decreasing trend and chlorides value has increasing trend, which has opposite trend in the medium ~ high quality group of wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a very simple linear model starting from alcohol and density.

The variables in the linear model account for 80% of the variance in the alcohol value of wines. residual.sugar and chlorides variables each imporve the R-squared value by 10%.

Alcohol value is a very important variable in the wines properties, which has the strongest relationship with wines quality. Since I didn’t find the linear relationship between quality and relvant variables, so I choose alcohol as a output to create a linear model. However, wine brewing is a very complated process, there are only fews physicochemical properties in our dataset, it is difficult to make this prodiction more accurated.


Final Plots and Summary

Plot One

Description One

The quality of wines can be scored from 0~10, around 75% of wines are scored in quality 5 and 6. There are no wines with quality less than 3 or greater than 9 in this dataset.

Plot Two

Description Two

With the increase of quality, the means of alcohol value tend to increase in the range of quality 5~9. However, in range of quality 3~5,the means of alcohol value tend to decrease.

Plot Three

Description Three

With the alcohol values increasing, density tend to decrease, there is a negitave linear relationship between alcohol and density. The plot also shows that wines with higher quality are in the right side of the plots,which is further illustrative that higher quality wines tend to have high alcohol and low density.


Reflection

This dataset consists of thirteen variables with 4898 observations. My main purpose is to find which chemical properties influence the quality of white wines, and at same time find the relationships between other features.

Firstly, I started to understand the variables by virsualizing the distribution of individual variables and looked for unusual behaviors in the histograms, and I transformed the residual.sugar variable distributions with log10.

Next, I used plot matrix to calculate and plot the correlations between the variables. None of the correlations with quality are above 0.5, the strongest correlation with quality is alcohol. Alcohol has relatively strong correlations with density, residual.sugar and chlorides. Through bivariate visualization analysis, I found that the quality of wine vs alcohol has two different direction relationships.it has negitive relationship with alcohol in quality 3-5, positive in quality 5-9. Alcohol has linear relationhips with density, residual.sugar and chlorides.

Eventually, I explored the quality of wines across with alcohol, density, chlorides. Higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values, so alcohol, density and chlorides infuluence the quality of white wines most. Since the plots show there is a linear relationship betwen alcohol and it’s relavant variables (density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.

After I did some rearch, I found that wine brewing is a very complicated process. The quality of wine is affected by many factors, such as grape varieties, geographical location and temperature, fermentation temperature and time, the physicochemical properties in our dataset and more. If we got all those information, I believe we could make a very good model to predict the wines quality, and even use this model to optimize the brewing process.